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FOREWORD 

This Indian Standard (Second Revision) was adopted by the Bureau oflndian Standards, after the draft finalized 
by the Statistical Method for Quality and Reliability Sectional Committee had been approved by the Management 
and Systems Division Council. 

The study of the relationship between two variables is of fundamental importance in industry. For example, in 
the building industry, while studying the properties of cement, it may be necessary to estimate the effect of 
curing time on the compressive strength. In such problems, where one variable is of particular interest for studying 
the effect of the other variable on it, the concept of regression is quite useful. The regression technique is also 
helpful for the purpose of prediction. 

In some problems, the relationship between two variables may be of great interest, for example, in the case of 
steel, one can study tensile^trength by using hardness test, as the latter has a strong relationship with the former. 
The determination of the extent of relationship between two variables leads to the concept of correlation. 

This standard was originally published in 1974 to cover the statistical methods of regression and correlation in 
the case of two variables. This standard was revised in 1995 to include the concept of 'scatter diagram' more 
elaborately. 

In view of the experience gained with the use of the standard in course of years, it was felt necessary to further 
revise it. In the revised version, following changes have been made: 

a) A table which gives the values of correlation coefficient (r) for different selected sample sizes has been 
included so that the sample correlation coefficient calculated value may directly be compared with this 
tabulated value to test whether the population correlation coefficient is zero or not, 

b) Confidence limits for the population regression line with example has been included, 

c) Many editorial corrections have been incorporated, and 

d) The concepts at many places have been elaborated for better understanding. 

The composition of the Committee responsible for the formulation of this standard is given in Annex F. 



IS 7300 : 2003 



Indian Standard 
METHODS OF REGRESSION AND CORRELATION 

( Second Revision ) 



1 SCOPE 

This standard covers the statistical methods of linear 
regression and correlation in the case of two variables. 
The computations have been illustrated with examples. 

2 REFERENCES 

The following standards contain provisions, which 
through reference in this text constitute provisions of 
this standard. At the time of publication, the editions 
indicated were valid. All standards are subject to 
revision and parties to agreements based on this 
standard are encouraged to investigate the possibility 
of applying the most recent editions of the standards 
indicated below: 

IS No. Title 

6200 (Part 1) : Statistical tests of significance : Part 1 
1995 /-Normal and F-tests {second 

revision) 
7920 (Part 1) : Statistical vocabulary and symbols : 
1994 Part 1 Probability and general 

statistical terms (second revision) 
8900 : 1978 Criteria for the rejection of outlying 

observations 
9300 (Part 1) : Statistical models for industrial 
1979 applications: Part 1 Discrete models 

3 TERMINOLOGY 

For the purpose of this standard the definitions given 
in IS 7920 (Part 1) shall apply. 

4 BASIC CONCEPTS 

4,1 Scatter Diagram 

4 . 1 . 1 The scatter diagram is useful to know the presence 
of the relationship or the nature of the relationship 
between two variables, if any. The relationship can be 
a cause and effect relationship, a relationship between 
one cause and the other, or a relationship between one 
effect and the other. 

4. 1 .2 Scatter diagram can even be used by the operators 
to find the relationship between two variables, if any. 
This may lead to taking appropriate actions for quality 
improvement. 

4.1.3 A scatter diagram is prepared by plotting the 
paired data in an^- Kplane. It is desirable to have more 
than 30 pairs of data. Of the two variables, one is said 



to be independent and the other dependent and it is 
usual to regard the independent variable as x and the 
dependent variable as y. Since the range of data varies 
widely, the origin of zero is sometimes inconvenient 
to prepare a well-balanced scatter diagram. The data 
ranges are suitably presented on convenient scales so 
that spread is close to a square and large enough for 
individual perception. 

4.1.4 The problem of outliers is encountered in the 
actual preparation of scatter diagrams. Outliers are too 
widely separated from the data set. If there are few 
outliers, they should be eliminated from the data. For 
guidance on the criteria for rejection of outliers, 
reference is invited to IS 8900. If there are many 
(generally more than 25 percent) outliers, the causes 
for the same should be investigated and corrective 
action taken. Thereafter, fresh data needs to be 
collected for plotting the scatter diagram. 

4.1.5 Interpretation of a Scatter Diagram 

When a scatter diagram is prepared, it is important to 
interpret it accurately and take necessary measures. For 
this purpose, the scatter diagram should be carefully 
observed for the relationship between two variables. 
The interpretation of the scatter diagrams is explained 
as follows: 

a) Positive relationship — In a scatter diagram, 
if y increases with increase in x, then the 
relationship is said to be positive. When the 
points are close to a straight line [see Fig. 1 
(a)], the relationship is called a positive linear 
relationship. Under such conditions control 
on^ (the dependent variable) can be achieved 
by exercising control on x (the independent 
variable). 

b) Negative relationship — In a scatter diagram, 
if ^^ decreases with increase in x, then the 
relationship is said to be negative [see Fig. 1 
(b)]. In this case, similar interpretation as 
given for (a) holds good. 

c) Weak relationship — Sometimes the 
relationships may not be as clearly evident as 
in (a) or (b) [see Fig. 1 (c)J. Further 
investigations may be required to find out the 
reasons, if any, for the wider scatter. Possibly 
one factor alone is not sufficient to explain 
the relationship fully or there could be wide 
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measurement errors. The relationship may not 
be useful for control purposes in such a 
situation. 

d) No relationship — In a scatter diagram [see f) 
Fig. 1 (d)], no relationship can be noted 
between x and y. If the presence of 
relationship is expected on technological 
considerations, the causes/effects may be 
examined from other viewpoints. In such a 
situation the possibility of stratifying the data 

may also be looked into [see 4.1.5 (e)]. 

e) Relationship revealed by stratification — The 
scatter diagram [see Fig. 1 (e)], shows no 
relationship at a glance, but if the data is g) 
classified into some different groups a 
relationship may be possible. In this diagram, 



the presence of relationship can be confirmed 
definitely by stratifying the data into three 
groups marked with: •, A and %• 
Non-linear relationship — In a scatter 
diagram there may be relationship between x 
and y but is non linear. For example, in 
Fig. I (f), y increases with an increase in x 
until a certain point, but decreases with an 
increase in x beyond that point. Such 
relationship is called non-linear relationship 
and can be treated otherwise. In such situation, 
it is convenient to locate optimal combination 
for X and y. 

Insufficient data range — When attention is 
paid only to the points marked with A, there 
seems to be no relationship between x andy. 



(a) 



X 



lb) 



(c) 



Y 




(e) 



X 



(d) 



X 



Y 



(f) 



X 



.. all 



(g) 



X 



Fig. I Various Scatter Diagrams 
2 
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as shown in Fig.l (g), but positive linear 
relationship is noted when points are observed 
in a little wider range. Accordingly, it is 
necessary to examine carefully the 
appropriateness of the range of ;c even when 
no relationship is suggested in the diagram 
prepared for the first time. 

4.2 Regression 

Regression deals with situations when one variable is 
dependent on the other variable. For example, the two 
variables may be the quantities of the carbon steel and 
alloy steel produced from the same raw material or 
charge, elongation of boiler plate and the amount of 
tension applied, amount of rainfall and the yield of a 
crop, and so on. Of the two variables, one is independent 
(generally measurable) and the other is dependent 
(desired to be controlled). Thus, it is evident that the 
production of alloy steel depends on the production of 
carbon steel so that the quantity of carbon steel produced 
could be considered as the independent variable and that 
of alloy steel as the dependent variable. 

4.3 Correlation 

Correlation deals with the relationship between two 
factors or variables. The degree or intensity of the linear 
relationship is measured by correlation coefficient. It 
may be mentioned that in the study of correlation, it is 
not the intention to find the effect of one variable over 
the other as in the case of regression analysis but it is 
to find the degree to which the variables vary together 
owing to influences which affect both of them. 
However, the mere existence of high value of the 
correlation coefficient is not necessarily indicative of 
the underlying relationship between the two variables. 
Such a value can at times be purely accidental, the two 
variables having no connection whatsoever. In such 
cases, the correlation coefficient may be spurious. 

4.4 Before carrying out any regression or correlafion 
study, it is desirable to look at the scatter diagram to 
locate the outlier(s), if any and eliminate them. 

5 REGRESSION ANALYSIS 

5.1 Regression Coefficient 

In a scatter diagram of type [see 4.1.5 (a) or (b)] a 
straight line could be fitted to the observed values 
which is of the form>' = a + bx, where y is the dependent 
variable and x the independent variable. The quantity 
a in the above equation represents the value ofy when 
x = 0, and b denotes the slope of the line and is known 
as the regression coefficient which may be negative or 
positive depending on the orientation of the line with 
respect to the axes. Physically, b indicates the rate of 
increase or decrease in the value ofy for unit increase 



or decrease in the value of a-. The regression line is 
also used for prediction purposes. Normally, 
extrapolation is not recommended, and when 
necessary, it should be used cautiously. 

5.1.1 The relationship of the type y = a + bx 
encountered in the regression analysis is not generally 
reversible and is based on the status of the variables 
concerned. Therefore, this type of relationship should 
not be used for predicting x for given y. However, 
mathematically it is possible to find relationship of the 
type x = a +b'y and then the regression lines intersect 
at the point (jc, y) in the x, y plane. 

5.2 Method of Calculation (Ungrouped Data) 

5.2.1 Let there be n pairs of observations for x and y 
corresponding to the items in the sample. For fitting 
the regression line the following expressions are then 
calculated: 



a) Average of j: 



x =■ 



Zx 



b) Average of >' y ~ 

c) Corrected sum of squares for x 

L{x-xy=j:x'-[{i:xy/n] 

d) Corrected sum of squares for y 

e) Corrected sum of products 

I.{x-x){y-y) = lxy-[{l.x){I,y)/n] 

NOTE — A suitable proforma as given in Annex A may be 
helpful in the above computations. 

5.2.2 From the above quantities the regression 
coefficient bor b'is calculated as: 

, Corrected sum of products 

o = — — 



Corrected sum of squares for x 
, , _ Corrected sum of products 
Corrected sum of squares for^' 

Also the constant a or a' of the regression equation is 
obtained as: 

a = y- bx 
a' = X ~ b' y 

5.2.3 When the regression model is not of the linear 
type and involves powers or exponentials, the model 
may be reduced to the linear type with the help of the 
logarithmic transformation. Thereafter, the fitting of 
the regression line is exactly similar to the one 
explained in 5.2.2. 
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5.2.4 Example 

Table 1 gives the Brinell hardness number and the 
tensile strength (expressed in units of megapascals) 
for 1 5 specimens of cold drawn copper. Consider 
Brinell hardness number as the independent variable 
(x) and tensile strength as the dependent variable (y). 
It is intended to fit a regression line to the data. 

5.2.4.1 Plotting the data given in Table 1 as a scatter 
diagram wherein the Brinell hardness number is 
measured along the X-axis and the tensile strength 
along the F-axis, Fig. 2 is obtained, from which the 
linear trend of the points is self-evident. For the sake 
of better understanding, the regression line applicable 
to the data is also drawn in Fig. 2. 

Table 1 Hardness and Tensile Strength Values 
of Cold Drawn Copper 

(Clauses 5.2.4, 5.2.4.1 and5.2A.2) 



SI 

No. 

(1) 



Specimen 
No. 

(2) 



Brinell 
Hardness 

X 

(3) 



Tensile 
Strength 

y 

(4) 



i) 
ii) 
iii) 
iv) 

V) 

vi) 

vii) 

viii) 

ix) 

X) 

xi) 
xii) 
xiii) 
xiv) 
xv) 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 



104.2 
106.1 
105.6 
106.3 
101.7 
104.4 
102.0 
103.8 
104.0 
101.5 
101.9 
100.6 
104.9 
106.2 
103.1 



268.0 
278.6 
275.0 
281.5 
232.4 
272.2 
227.5 
255.1 
259.5 
229.0 
233.8 
205.9 
272.0 
280.3 
242.2 



5.2.4.2 From the data in Table 1 , various computations 
are obtained as follows: 



Lx = 1 556.3 
Zy = 3 813.0 
Sx-' = 161 520.87 



X = 103.75 

y = 254.2 



Ijcy = 396 226.39 

Corrected sum of 
squares for jc 



= 161 520.87 -[(1 556.3)2/15] 
= 161 520.87-161 471.31 
= 49.56 



Corrected sum of 
products 



= 396 226.39- 



1556.3x3813 



= 396 226.39-395 611.46 

= 614.93 
b = 614.93/49.56=12.4 
a = J-Z)3c = 254.2 -1286.5 =-1032.3 

Hence regression line is obtained as 

y = - 1 032.3 + 1 2.4 ;c 

5.2.4.3 For simplifying the computational work 
involved in fitting a regression line, change of origin 
is often helpful in one or both the variables. Thus, for 
the example worked out in 5.2.4.2, if the variables 

JC and y are changed to u and v such that u = x- 100 
and V = >' - 250, then the computations would be as 
follows: 



Sm = 56.3 
Sv = 63.0 
Zu^ = 260.87 



u =3.75 
v = 4.20 



Imv = 851.39 

5 (m - J7 y = Zm ' - [(£«)'/«] = 260.87 - 21 1.31 
= 49.56 

Z (w - « ) (v - 7 ) = Suv - [(Zm) (Zv)/«] 
= 851.39-236.46 = 614.93 

6 = 614.93/49.56= 12.4 and V-bu =4.2-46.5 
= -42.3 

Hence the regression line is obtained as 7= - 42.3 + 
12.4 u which when transformed to the original 
variables, comes out as: 

(y - 250) = - 42.3 + 12.4 {x - 100) 
thatis>; = -l 032.3 + 12.4 ;c 

NOTE — It would be of interest to observe that the regression 
coefficient b is not affected by the change of origin of either or 
both the variables. 



c 

0) 



CO 

c 



300 

280- 

260- 



*S 240-1 



220 - 
200 



100 



102 



104 



106 



108 



Brinell Hardness No. (x) 
Fig. 2 Scatter Diagram Alongwith the Regression Line 
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5.2.4,4 From this equation the expected value of tensile 
strength for any given Brinell hardness number could 
be obtained. Thus, when the hardness number is known 
as 105 the corresponding expected value of tensile 
strength would be 269.7 megapascals. 

5.2.5 Construction of Confidence Limits for the 
Regression Line 

The model is ^^ = a + Pjf + error 

The estimates 6 of P and a of a are obtained for the 

example as : 

a = - 1032.3, 6=12.4 

The error sum of squares (a^ ^ is given by : 

10, - a - 6 x)2 / (15 - 2) = 30.813 

For a particular value of A'= .x (Brinell hardness = x), 
the predicted value of the tensile strength (j) ) is: 

>) = - 1032.3 + 12.4 a: 

The standard deviation of y given X= a: is 

^(i^) = a^,, [(!/«)+ {(^- xfmx- xf)f' 

Therefore, for a given x the confidence limits on the 

value of y are 

y ±t s {y /x) 

where t is the value of a / distribution with (« - 2 = 1 3) 
degrees of freedom. 

Since we are interested in the confidence limits for the 
whole of the regression line, these limits for individual 
y have to be relaxed. The appropriate multiplier is 
(2 FY' where F is the upper 5 percent tail of F 
distribution with degrees of freedom Y, = 2 and 
7, = (n - 2) = (15 - 2) = 13. From the tables of F, the 
value of /^(2, 13), at 5 percent level of significance is 
3,80. So the multiplier is = (2 x 3.8)'/= = 2.76. 

So the confidence limits for the regression line are 

given by: 

a + Z)x + 2.76rV30.813{(l/15) + (;c-j)'/49.56J 

and 

a + fe-2.76rV30.813J(l/15) + (;c-l)'/49.56| 

Therefore, the confidence limits of regression line for 
the data given in Table 1 have been calculated from 
the above expressions and are given in Table 2. 

NOTE — The upper and iowar limits for the regression line 
form a hyperbolic curve. When x is close to x the contribution 
of this term is small. As a; deviates from x , the contribution of 
this term increases. 

5.3 Method of Calculation (Grouped Data) 

5.3.1 Sometimes, the observations on the two variables 



X and;; are presented in the form of a frequency table. 
In such situations the range of each variate is divided 
into a tuimber of class intervals of equal width (say l^ 
for p classes of independent variable x and / for q 
classes of dependent variable >»). The class width for jc 
and y need not be equal, and the frequency/^. . in the 
cell is determined by the rth class interval of the first 
variate and^th class interval of the second variate. This 
would result in a bivariate frequency distribution table 
{see Annex B) 

Table 2 Confidence Limits for Regression Line 

{Clause 5.2.5) 



X 


y 


Upper Limit 


Lower Limit 


(1) 


(2) 


(3) 


(4) 


100.6 


215.14 


223.05 


207.23 


101.5 


226.30 


232.59 


220.01 


102.0 


232.50 


237.99 


227.01 


103.1 


246.14 


250.36 


241.91 


103.8 


254.82 


258.78 


250.85 


104.2 


259.78 


263.86 


255.70 


104.9 


268.46 


273.14 


263.78 


105.6 


277.14 


282.78 


271.50 


106.3 


285.82 


292.63 


279.01 



5.3.2 As a first step for calculating the regression line, 
another proforma {see Annex C) is to be prepared. 

5.3.3 The different entries in the above proforma are 
explained below: 

a) In the top row are given the mid-values of 
the class intervals for the independent variable 
x whereas in the first column are given mid- 
values of the class intervals for the dependent 
variable y. In the column/ are given the total 
frequencies of the corresponding rows 
whereas in the row corresponding to_/^ are 
given the total of the corresponding 
frequencies in the various columns. 

b) In the row corresponding to u are given the 
transformed variables for a: which are obtained 
by subtracting an arbitrary quantity x^ 
(preferably value of x closest to median) from 
each of the mid-values of the class intervals 
for X variate and dividing these differences by 
the width of the class intervals for x variate. 
That is, M = {x-x^ll^, where /^ is the width of 
the class interval forx. A similar transformed 
variable v is given for the variate y in the 
respective column v = {y-y^ll . 

c) The next two rows, namely, irf^ and m^/ are 
self-explanatory. So also the two columns 
corresponding to vf and v^f. 

d) The row corresponding to V is obtained as 
sum of the products of v and the 
corresponding frequency in the column. So 
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e) 



also the column corresponding to {/consists 
of entries obtained as the sum of products u 
with the corresponding frequency in the row. 
The last row «F as also the last column vU 
are self explanatory. 



5.3.4 To ensure the correctness of computations it 
would be necessary to verify the following checks from 
the above proforma: 

a) The total frequency of the row corresponding 
to/^ should be equal to the total frequency of 
the column corresponding to/. 

b) The total of the row corresponding to V should 
be equal to the total of the column 
corresponding to vf. 

c) The total of the column corresponding to U 
should be equal to the total of the row 
corresponding to uf^. 

d) The sum of the last row corresponding to uK 
should be equal to the sum of the last column 
corresponding to vU. 

5.3.5 After the computations using the above proforma, 
the regression coefficient is calculated by the following 
formula: 

b = [luV- {{ZU) (ZV)/n}] //[Lu^/^- il^uf^/n] /, 

The constant of the regression line is obtained as: 

a-y-bx={y^ + l^ (Zv//«)} - 6 ^ + /, i^ufjn)} 

5.3.6 Example 

Table 3 gives the distribution of 82 small and medium 
size sugar factories by the quantity of cane crushed (x) 
and the quantity of sugar produced (y). Fit a regression 
line. 

As a first step, the computation are made in Table 4. 



The regression coefficient is computed as: 

i. = [329-{(-78)(-35)/82}] x2/[410- {(-78)^/82}] 
X 20 = 0.088 1 B 0.09 

The constant of the regression line is obtained as: 

a = { 12 - (2 X 35/82)}- 0.088 M 140- (20 X 
78/82)} 

= U. 15-0.0881 X 120.98= 11.15-10.66 
= 0.49 
Hence the regression line is obtained as: 

y = 0.49 + 0.09 ;c 
This line can be used for predicting the quantity of 
sugar produced knowing the quantity of the cane 
crushed. Thus if 100 tonne of cane is crushed, then by 
the above equation, 9.5 tonne of sugar can be expected 
to be produced. 

NOTE — For the purpose of prediction, the regression line may 
be used only within the range of the independent variable and 
in the vicinity of the terminal values. 

5.4 Testing for Regression Coeflicient 

5.4.1 From the way the regression coefficient is to be 
calculated, it is obvious that its value depends on the 
sample observations. Hence if a new set of observations 
is obtained from the same population and the 
corresponding regression coefficient is calculated, it 
may not necessarily be the same as earlier one. Because 
of this fluctuation it is necessary to test whether the 
regression coefficient as calculated from the sample 
observations differs significantly from some specified 
value which may correspond to the entire population. 
Sometimes, the specified value may also be a rounded 
off value which seems more feasible as the population 
regression coefficient. However, for any testing of the 
regression coefficient to be valid, it is assumed that 
both the independent and the dependent variables 



Table 3 Frequency Distribution of Sugar Factories by Cane Crushed and Sugar Produced 

{Clause 5.3.6) 



SI 


Sugar 








Cane Crushed in 1 000 tonne (jc) 








No. 


Produced in 
1 000 tonne (y) 












^ 






























^^ 






30-49 


50-69 


70-89 


90-109 


110-129 


130-149 


150-169 


170-189 


190-209 


210-229 


(I) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) 


(10) 


(11) 


(12) 


i) 


3.0- 4.9 


2 


- 


- 


- 


- 


_ 


— 


— 


- 


— 


ii) 


5.0- 6.9 


- 


6 


2 


1 


- 


- 


- 


- 


- 


- 


iii) 


7.0- 8.9 


- 


- 


8 


4 


1 


- 


- 


- 


- 


- 


IV) 


9,0-10,9 


- 


- 


- 


11 


6 


1 


- 


- 


- 


- 


V) 


11.0-12.9 


- 


- 


- 


1 


7 


7 


2 


1 


- 


- 


vi) 


13.0-14.9 


- 


- 


- 


- 


- 


9 


4 


- 


- 


- 


vii) 


15.0-16.9 


- 


- 


- 


- 


- 


- 


1 


- 


1 


- 


viii) 


17.0-18.9 


- 


- 


- 


- 


- 


- 


- 


1 


2 


- 


ix) 


19.0-20.9 


- 


- 


~ 


_ 


- 


— 


— 


- 


1 


1 


X) 


21.0-22.9 


- 


- 


- 


- 


- 


- 


- 


- 


1 


1 
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Table 4 Proforma for Computation of Regression Line for the Data Given in Table 3 

(Clause 5,3.6) 





40 


60 


80 


100 


120 


140 


160 


180 


200 


220 


/. 




V 


-/y 


vy, 


U 


vU 


4.0 


2 


_ 


_ 


_ 


_ 


_ 


_ 


_ 


_ 


_ 


2 




-A 


-8 


32 


-10 


40 


6.0 




6 


2 


1 


- 


- 


- 


- 


- 


- 


9 




-3 


-27 


81 


-32 


96 


S.O 


- 


- 


S 


4 


1 


- 


- 


- 


_ 


- 


13 




-2 


-26 


52 


-33 


66 


iO.O 




- 


~ 


11 


6 


1 


- 


- 


- 


_ 


18 




-1 


-18 


18 


-28 


28 


12.0 


- 


- 


- 


1 


7 


7 


2 


1 


- 


- 


18 













-5 





14.0 


- 


- 


- 


- 


- 


9 


4 


— 


- 


_ 


13 




1 


13 


13 


4 


4 


16.0 


- 


- 


- 


- 


- 


- 


1 


- 


1 


- 


2 




2 


4 


8 


4 


8 


18.0 


- 


- 


- 


- 


- 


- 


- 


1 


2 


- 


3 




3 


9 


27 


8 


24 


20.0 


- 


- 


- 


- 


_ 


- 


- 


- 


1 


1 


2 




4 


8 


32 


7 


28 


22.0 


- 


- 


- 


- 


- 


- 


- 


- 


I 


1 


2 




5 


10 


50 


7 


35 


/, 


2 


6 


10 


17 


14 


17 


7 


2 


5 


2 


82 




5 


-35 


313 


-78 


329 


II 


-5 


-4 


-3 


-2 


-I 





1 


2 


3 


4 








A 




t 


▲ 




-10 


-24 


-30 
90 


-34 
68 


-14 
14 






7 
7 


4 

8 


15 

45 


8 
32 


-78 
410 


^_ 














50 


96 


^^ 
















y 


-8 


-18 

72 


-22 
66 


-22 
44 


-8 
8 


8 



6 
6 


3 
6 


17 


9 
36 


-35 
329 


.^_ 
















uV 


40 


1 / 

51 


















^ 










^^ 


^^ 



follow a normal distribution. For further details of the 
normal distribution, see IS 9300 (Part 1). 

5.4.2 Ungrouped Data 

The value of the regression coefficient as calculated 
from the sample is an estimate of the true regression 
coefficient for the entire population. To judge whether 
the population regression coefficient differs 
significantly from a specified value, P^, the null 
hypothesis, H^: P = Pgis tested against the alternative 
hypothesis, h\ : ^ ¥= p^ by computing the following 
test-statistic: 



t =■ 



{b-A){n^-xff 



'{Uy-y)" -bY.{x-x)[y-y)]l{n-l)^ 



b = regression coefficient com- 

puted from the data, 

Pp = specified value of the regres- 

sion coefficient, 

Z(x - jf )- = corrected sum of squares for jc, 

^0' -y)' = corrected sum of squares for _y, 

^{^ -x)(y~y)= corrected sum of products, and 

n = sample size. 

The calculated value of r shall be compared with the 

tabulated value of r [see Annex B of IS 6200 (Part 1)] 

at desired level of significance (normally 5 percent) 

and for (« - 2) degrees of freedom. If the calculated 

value of r is greater than or equal to the tabulated value, 

the null hypothesis is rejected and the alternative 

hypothesis that the population regression coefficient 

is significantly different from the specified value of P^ 

is accepted, otherwise not. 

As a particular case, when Po= the test would be 



used to verify the assumption that the change in the 
independent variable does not affect the dependent 
variable in the population in a systematic manner, 

5.4^.1 Example 

In the illustration given in 5.2.4 concerning the 
regression equation for predicting the tensile strength 
from Brinell hardness number for cold drawn copper, 
it may be of interest to test whether the population 
regression coefficient is significantly different from 
the specified value of 1 1 .0 which was found to hold 
good in an earlier investigation done on a largenumber 
of samples. 

In this case, //^ : p = 1 1.0 and //,: p jt i i.o 

The /-statistic is calculated as follows: 

/ = (12.4-1 1.0) (49.56)"V[{8 030.§^- (12.4 
X 614.93)}/13]"^ 
= 1.4x7.04/5.59=1.76 
The tabulated value of / at 5 percent level of 
significance and for 13 degrees of freedom is given as 
2.160. Since the calculated value is less than 2.160, 
//q is not rejected and it is concluded that the population 
regression coefficient is not significantly different 
from 11.0. 

5.4.3 Grouped Data 

In the case of grouped data for testing whether the 
population regression coefficient is significantly 
different from the specified value of P^, the null 
hypothesis, ^^ : P = P„is tested against the alternative 
hypothesis, H^:^^ Po^"'* ^^^ /-statistic is computed 
as: 

t = a/s[b 

A = {b- p.) [lu'f - {(Z«/JV/7}]''l / 
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IvV,- 



(I v/. f 



i.K-(^^H^") 



The testing can be done on the same lines as indicated 
in 5.4.2. 

5.4.3.1 In the example given in 5.3.6 for predicting 
the quantity of sugar produced from the quantity of 
cane crushed, it may be of interest to examine whether 
the population regression coefficient is significantly 
different from zero. 

In this case, /fg : p = 0; //, : p ?t 

/-statistic is computed as follows: 

A = 0.09 [410 - {(- 78)2/82}]"^ x 20 = 32.9850 

B= [313-(-35)'/82}x4l- 
ro.09|329-(-78)(-35)/82}x20x2l 



t = 32.985 0/71.596 2 = 32.985 0/1.263 4 = 26.1 

Since the tabulated value of / at 5 percent level of 
significance and 80 degrees of freedom is given as 1 .96, 
//p is rejected and it is concluded that the population 
regression coefficient is significantly different from 
zero. 

6 CORRELATION 

6.1 Correlation Coefdcient 

6.1.1 The correlation coefficient is usually denoted by 
the symbol p with respect to the population under 
study. When the study is based on a sample drawn 
from a population, it is denoted by the symbol r. Values 
of the correlation coefficient lie between -I and +1 . If 
it is + 1 , perfect positive correlation exists between the 



two factors under study which implies that a definite 
increase in one factor is accompanied by a 
proportionate increase in the other factor. 

6.1.2 If r = -1 then perfect negative correlation is 
present, meaning thereby that a definite increase in one 
factor is followed by proportionate decrease in other 
factor, or vice versa. If the correlation coefficient is 
zero then the two factors are said to be uncorrelatcd. 
The correlation coefficient is a pure number and its 
magnitude is unaffected by the scale in which the two 
variables x and y are measured. 

6.1.3 Figure 3 gives the scatter diagram for the two 
variables ;c and;' in three situations, namely, when the 
correlation coefficient is high positive (say, 0.9), zero 
and high negative (say, - 0.9). 

6.2 Method of Calculation (Ungrouped Data) 

6.2.1 Let there be n paired observations x and y 
corresponding to the items in the sample. The average 
of X, average of y, corrected sum of squares for x, 
corrected sum of squares for y and corrected sum of 
products are then calculated as given in 5.2.1. 

6.2.2 From the above quantities the correlation 
coefficient is calculated as follows: 

Corrected sum of products 



(Corrected sum of squares for or) x 
(Corrected sum of sum of squares for;*) 



6.2.3 When the status of the two variables are not 
known {see 5.1.1) the two regression coefficients 
obtained in fitting the lines' >» = a + b x and 
X = a' + b'y and the correlation coefficient r are 
related a&r = b b'. 

6.2.4 Example 

An investigation was carried out on 4-litre paint tins 

-for finding the correlation between the capacity as 

calculated from the base dimensions and height and 



Y 
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the actual measured capacity with a view to reducing 
the testing for the latter characteristic which was more 
time consuming as compared to dimensional checking. 
Table 5 gives the data obtained on 35 such tins. From 
the data tabulated in Table 5 the various computations 
are obtained as follows: 

Lv = 165.930 x = 4.74 

!>' = 158.095 3^ = 4.517 

Lv- = 786.678 154 

Zj- = 714.131775 

Lvy = 749.518 555 

Corrected sum of squares for ;c = 0.027 728 

Corrected sum of squares for)' = 0.016 660 

Corrected sum of products = 0.012 745 

Hence the correlation coefficient is equal to 

/- = 0.012 745/(0.027 728 x 0.016 660)"2= 0.59 

NOTE — It may be of interest to observe that the correlation 
coefficient ;• is not affected by the change of origin and scale 
for either or both the variables. Hence the computations given 
in the above examples can be simplified considerably by making 
the transformations as follows: 
I, = (v 4.700) X 1 000 
V = iv - 4.500) X I 000 

6.3 Method of Calculation (Grouped Data) 

6.3.1 If the observations on the two variables x andy 
are presented in the form of a frequency table in which 
the range of each variate is divided into a number of 
class intervals and the frequency/^. . corresponds to 
the cell determined by the ith class interval of the first 
variate and theyth class interval of the second variate 
then the initial computations for obtaining the 
correlation coefficient would be exactly the same as 
given in 5.3.1 and 5.3.2. 

6.3.2 After the necessary tabulation of initial 
computations (see 5.3.2) are ttiade, the correlation 
coefficient is obtained by the following formula: 



Table 5 Capacities of 4-Litre Paint Tins 

{Clause 6.2.4) 



SI No. 


Calculated Capacity 


Measured 


of Tin 




Capacity 




X 


y 


(I) 


(2) 


(3) 


1 


4.732 


4.530 


2 


4.735 


4.540 


3 


4.756 


4.550 


4 


4.709- 


4.540 


5 


4.708 


4.540 


6 


4.768 


4.500 


7 


4.726 


4.490 


8 


4.744 


4.510 


9 


4.686 


4.485 


10 


4.693 


4.495 


11 


4.695 


4.480 


12 


4.694 


4.485 


13 


4,692 


4.485 


14 


4.727 


4.490 


15 


4.729 


4.490 


16 


4.745 


4.500 


17 


4.741 


4.500 


18 


4.704 


4.5 to 


19 


4.741 


4.510 


20 


4.745 


4.515 


21 


4.771 


4.520 


22 


4.774 


4.515 


23 


4.768 


4.510 


24 


4.758 


4.525 


25 


4.772 


4.520 


26 


4.779 


4.550 


27 


4.763 


4.550 


28 


4.757 


4.540 


29 


4.781 


4.550 


30 


4.784 


4.555 


31 


4.758 


4.515 


32 


4.753 


4:520 


33 


4.753 


4.525 


34 


4.732 


4.525 


35 


4.757 


4.530 




16 + 66.88 


-C\A1 



[{254 -57.76)(264- 77.44)] 



r = ■ 



I«K-{(IC/)(IF)/«} 



{l^V, -(!«/;)' /«}{SvV^-(Zv/j'/n) 



6.3.3 Example 

Table 6 gives the distribution of 100 casts of steel by 
the percentage of iron in the form of pig iron (x) and 
the lime consumption in quintal per cast (y). As a first 
step the computations as in Table 7 are made: 

The correlation coefficient r is calculated as: 
16-76(-88)/100] 



{254-{76y /100}{264-(-88)'/100} 



>^ 



6.4 Testing for Correlation Coefficient 

6.4.1 Correlation coefficient as calculated from the 
sample data is the estimate of the correlation coefficient 
applicable to all the items in the population. It may, 
however, sometimes be necessary to test whether the 
population correlation coefficient differs significantly 
from a specified value of p^. The corresponding tests 
to be performed when p is equal to zero and when p^ 
is a non-zero value arc slightly different and are given 
below. 

6.4.2 To judge whether the population correlation 
coefficient differs significantly from zero (that is p = 0), 
the null hypothesis, //^ : p = is tested against the 
alternative hypothesis, H^ : p ^ by computing the 
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Table 6 Frequency Distribution of Percentage of Pig Iron (jc) and Lime Consumption (y) 

(Clause 6.3.3) 



SI 


Lime 








Percentage of Pig 


Iron (x) 






No. 


Consumption 

in Quihtal per 

Cast (y) 


















^^ 












^~s 






20-24 


25-29 


30-34 


35-39 


40-44 


45-49 


50-54 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) 


1) 


100-124 


_ 


_ 


_ 


1 


_ 






11) 


125-149 


- 


2 


1 


7 


1 


_ 


_ 


iii) 


150-174 


1 


I 


6 


6 


1 


2 


2 


iv) 


175-199 


- 


_ 


6 


12 


3 


6 


5 


y) 


200-224 


- 


- 


- 


3 


8 


11 


3 


Ml 


225-2 


49 


- 


- 


1 


1 


2 


3 1 


Ml) 


250-274 


_ 


_ 


_ 


_ 




1 


1 


\ iii) 


275-299 


- 


- 


- 


1 


_ 


_ 


_ 


iM 


300-324 


- 


- 


- 


- 


- 


1 


- 



Table 7 Computations 

{Clause 6.3.3) 





22 


27 


32 


37 


42 


47 


52 


A 


V 




^f. 


U 


vV 


112 


- 


- 


- 


1 


- 


~ 


- 


1 


-4 


-A 


16 








137 


- 


2 


1 


7 


1 


- 


- 


11 


-3 


-33 


99 


-4 


12 


162 


1 


1 


6 


6 


1 


2 


2 


19 


-2 


-38 


76 








187 


- 


- 


6 


12 


3 


6 


5 


32 


-1 


-32 


32 


24 


-24 


212 


- 


- 


- 


3 


8 


11 


3 


25 











39 





237 


- 


- 


1 


1 


2 


3 


1 


8 


1 


8 


8 


10 


10 


262 


- 


- 


- 


- 


- 


1 


1 


2 


2 


4 


8 


5 


10 


287 


- 


- 


- 


1 


- 


- 


- 


1 


3 


3 


9 








312 


- 


- 


- 


~ 


- 


1 


- 


I 


4 


4 


16 


2 


8 


II 


1 
-3 
-3 


3 
-2 
-6 


14 
-1 

-14 


31 




15 

1 
15 


24 

2 
48 


12 

3 

36 


100 


— 


-8 


8 


264 


76 


1 
> 


6 
k, 


"A 


76"*" 














"X 


9 


12 


14 





15 


96 


108 


254 


















6 


-8 
16 


-20 
20 


-45 



-6 
-6 


-1 
-2 


-6 
-18 


-88 "*" 














111 


16 "*" 















following statistic: 



1/2 



f = r(«-2)"2/(l-r2) 

\\ here r = correlation coefficient as computed from 
the sample and n is the sample size. 

The value of/ so calculated shall be compared with the 
tabulated value of/ [see Annex B of IS 6200 (Part 1)] at 
the desired level of significance (normally 5 percent) 
and for (/? ~ 2) degrees of freedom. 

if the calculated value of/ is greater than or equal to 
tile tabulated value, H^ is rejected and the population 
correlation coefficient is said to be significantly 
different from zero, meaning thereby, that the two 
factors under consideration are correlated. However, 
if the calculated value of/ is less than the tabulated 
value, //|| is not rejected and it indicates that the sample 
data does not show any evidence that the factors under 
consideration are correlated. 

For some selected values of sample sizes «, the table 
values of/- have been calculated for critical values of/ 
ai 5 percent and 1 percent level of significance and 



given in Annex D. If the calculated value of correlation 
coefficient value is less than the tabulated value, the 
null hypothesis is accepted, otherwise not. 
6.4.2.1 Example 

In the illustration given in 6.3.3 wherein the correlation 
coefficient between the percentage of pig iron and lime 
consumption in quintal per cast was computed as 0.43, 
if it is intended to test whether the population correlation 
coefficient is significantly different from zero, the null 
hypothesis is Hg : p = 0.43 and the alternative hypothesis 
is //, : p ^ 0.43. The /-statistic is computed as: 

/ = /•(«~2)''V(l-r2)"^ 

t = (0.43) 798/(0.815 I)"^= 4.71 

Since the tabulated value of / distribution with 
98 degrees of freedom and 5 percent level of 
significance is near about 1.96, H^ is rejected and it is 
concluded that the population correlation coefficient 
is significantly different from zero, that is, the variables 
are associated to a significant extent. 



10 
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6.4,3 For judging whether the population correlation 
coefficient differs significantly from the specified value 
Pp (other than zero), the null hypothesis, H^^: p ^p^ 
shall be tested against the alternative hypothesis, H^ : 
p ^ Pq. The sample correlation coefficient r and the 
specified value p^ shall be transformed into z and z^ 
with the help of Annex E and the following statistic 
shall be computed: 

r = |z-zJ(«-3)"2 

where \z - z^l denotes the value of the difference 
between z and z^ ignoring the sign. 

If the value of this statistic is less than or equal to 1 .96 
(corresponding to 5 percent level of significance of 
the normal deviate), then H^ is not rejected and it 
indicates that the population correlation coefficient is 
not significantly different from p^. In case the 
calculated value of the normal deviate is more than 
1 .96, //q is rejected and it indicates that the population 
correlation coefficient is significantly different from 
the specified value of p,,. 

NOTE — If the level of significance chosen is 1 percent then 



instead of 1.96 the value 2.58 is to be used in the above 
comparison. 

6.4.3.1 Example 

In the illustration given under 6.2.4 wherein the 
correlation coefficient between the calculated capacity 
and measured capacity of 4-litre paint tins was 
computed as 0.59, it may be of interest to test whether 
the population correlation coefficient is significantly 
different from 0.70. 

In this case, null hypothesis is ifj, : p = p^ and the 
alternative hypothesis is /f , : p ?t Pg . 

From Annex E, the value of z corresponding to 
r = 0.59 is obtained as 0.677 7 and that of z^ 
corresponding to p^ = 0.70 is obtained as 0.867 3. 

Hence | z - zj (n - 3)'^ = |0.677 7 - 0.867 3| ^/32 = 
0.189 6x5.66=1.073 

Since this value is less than 1.96, H^ is not rejected 
and there is not enough evidence to conclude that 
population correlation coefficient is significantly 
different from 0.70. 



ANNEX A 

(Clause 5.2.1) 

PROFORMA FOR COMPUTATION OF CORRELATION/REGRESSION 

Product: 

Independent variable (x): 

Dependent variable (y): 



Unit of measurement: 
Unit of measurement: 



SI 
No. 

(1) 



Jf-X 



« = ■ 



(2) 



(3) 



(4) 



y-yo 



(5) 



(6) 



(7) 



MV 



(8) 



Total 



Mean 

NOTE — In case the variables x andy are not transformed to u and v respectively, col 6, 7 and 8 may be utilized for tabulating x'. y and xy. 
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ANNEX B 

(Clause 5.3.1) 

BIVARIATE FREQUENCY DISTRIBUTION TABLE 





p classes ofequc 
i=\ 2 3 


[/ widi 
i 


th for x-variable 

p Total 


1 ' 








1 J 




f ■ 




"3 
1 


r 








Total 




n 



/^.yj = frequency in the (/,y) cell. 
n = total frequency 
NOTE — X. is the mid-point of the interval I'th column andy. is the mid-point of interval y'th row. 



ANNEX C 

{Clause 5.3.2) 

PROFORMA FOR CALCULATING REGRESSION LINE FOR GROUPED DATA 



\<^ 




f. 


.=.,;. 


-/. 


^f. 


u 


vU 


















/. 




n 
















• 




t 


K 




i 


k 


> 


k 


V 




^ 


















^ 
^ 
















uV 


^ 
^ 














^ 
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ANNEX D 

(Clause 6 A.2) 
TABULATED VALUES OF r FOR 5 PERCENT AND 1 PERCENT LEVEL OF SIGNIFICANCE 

n Calculated Values ofr 



3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 



5 percent Level of Significance 


1 percent Level of Significance 


1.000 


1.000 


1.000 


1.000 


0.954 


0.985 7 


0.891 


0.955 9 


0.829 4 


0.918 8 


0.774 3 


0.880 1 


0.726 6 


0.842 6 


0.685 5 


0.807 6 


0.649 8 


0.775 5 


0.618 8 


0.746 1 


0.591 5 


0.719 3 


0.567 4 


0.694 8 


0.545 7 


0.672 3 


0.526 5 


0.651 8 


0.508 8 


0.632 9 


0.493 


0.615 4 


0.478 4 


0.599 1 


0.465 


0.584 


0.452 6 


0.570 1 


0.441 2 


0.556 9 


0.430 7 


0.544 7 


0.420 7 


0.533 2 


0.4115 


0.522 3 


0.402 8 


0.512 2 


0.394 7 


0.502 4 


0.387 


0.493 4 


0.379 7 


0.484 7 


0.372 7 


0.476 4 



ANNEX E 

(Clause 6.4.3) 

THE Z-TRANSFORMATION OF THE CORRELATION COEFFICIENT (z = tanh 'r) 



r 


0.00 


0.01 


0.02 


0.03 


0.04 


0.05 


0.06 


0.07 


0.08 


0.09 





0.000 


0.010 


0.020 


0.030 


0.040 


0.050 


0.060 1 


0.070 1 


0.080 2 


0.090 2 


0.1 


0.100 3 


0.1104 


0.120 6 


0.130 7 


0.140 9 


0.151 1 


0.1614 


0.1717 


0.182 


0.192 3 


0.2 


0.202 7 


0.213 2 


0.223 7 


0.234 2 


0.244 8 


0.255 4 


0.266 1 


0.276 9 


0.287 7 


0.298 6 


0.3 


0.309 5 


0.320 5 


0.3316 


0.342 8 


0.354 1 


0.365 4 


0.376 9 


0.388 4 


0.400 1 


0.41 1 8 


0.4 


0.423 6 


0.435 6 


0.447 7 


0.459 9 


0.472 2 


0.484 7 


0.497 3 


0.510 1 


0.523 


0.536 1 


0.5 


0.549 3 


0.562 7 


0.576 3 


0.590 1 


0.604 2 


0.618 4 


0.632 8 


0.647 5 


0.662 5 


0.677 7 


0.6 


0.693 1 


0.708 9 


0.725 


0.741 4 


0.758 2 


0.775 3 


0.792 8 


0.810 7 


0.829 1 


0.848 


0.7 


0.867 3 


0.887 2 


0.907 6 


0.928 7 


0.950 5 


0.973 


0.996 


1.020 


1.045 


1.071 


0.8 


1.099 


1.127 


1.157 


1.188 


1.221 


1.256 


1.2^3 


1.333 


1.376 


1.422 


0.9 


1.472 


1.528 


1.589 


1.658 


1.738 


1.332 


1.946 


2.092 


2.298 


2.647 
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ANNEX F 

{Foreword) 

COMMITTEE COMPOSITION 

Statistical Methods for Quality arid Reliability Sectional Committee, MSD 3 



Organization 
Kolkata University, Kolkata 
Bharat Heavy Electricals Limited, Hyderabad 

Continental Devices India Ltd, New Delhi 

Directorate General of Quality Assurance, New Delhi 

Laser Science and Technology Centre, DRDO, New Delhi 

Escorts Limited, Faridabad 

HMT Ltd, R&D Centre, Bangalore 

hidian Agricultural Statistics Research Institute (lASRI), New Delhi 

Indian Association for Productivity, Quality & Reliability (lAPQR), 

Kolkata 
indian Institute of Management (IIM), Lucknow 
Indian Statistical Institute (ISI), Kolkata 

National Institution for Quality and Reliability (NIQR), New Delhi 

Powergrid Corporation of India Ltd, New Delhi 

SRF Limited, Chennai 

Standardization, Testing and Quality Certification Directorate (STQCD), 

New Delhi 
Tata Engineering and Locomotive Co Ltd (TELCO), Jamshedpur 

University of Delhi, Delhi 

In personal capacity (B-109, Malviya Nagar, New Delhi 1 lOOIT) 

In personal capacity (2(?//, Krishna Nagar, Safdarjang Enclave, 

New Delhi 110029) 
BIS Directorate General 



Representative(s) 
Prof S, P. Mukherjee (Chairman) 
Shri S, N. Jha 

Shri a. V. Krishnan (^Alternate) 
Or Navin Kapur 

Shri Vipul Gupta (Alternate) 
Shri S. K.. Srivastva 

Lt-Col P. ViJAVAN (Alternate) 
Dr Ashok Kumar 
Shri C. S. V. Narendra 
Shri K. Vuayamma 
Dr S. D. Sharma 

Dr A. K. Srivastava (Alternate) 
Dr B. Das 

Prof S. Chakraborty 
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